1 Educational attainment by race

1.1 Data setup

acs_vars_2018_5yr = listCensusMetadata(name = "2018/acs/acs5",
                                       type = "variables")
saveRDS(acs_vars_2018_5yr, file = "acs_vars_2018_5yr.rds")

acs_vars_2019_1yr = listCensusMetadata(name = "2019/acs/acs1",
                                       type = "variables")
saveRDS(acs_vars_2019_1yr, file = "acs_vars_2019_1yr.rds")
acs_vars_2018_5yr = readRDS("acs_vars_2018_5yr.rds")
acs_vars_2019_1yr = readRDS("acs_vars_2019_1yr.rds")

census_race_labels = c("White Alone",
                       "Black or African American",
                       "American Indian and Alaska Native Alone",
                       "Asian Alone",
                       "Native Hawaiian and Other Pacific Islander Alone",
                       "Some Other Race Alone",
                       "Two or More Races")
bay_educ_race = 1:7 %>%
  map_dfr(function(x) {
    getCensus(name = "acs/acs1",
              vintage = 2019,
              region = "county:085",
              regionin = "state:06",
              vars = paste0("group(B15002", LETTERS[x], ")")) %>%
    select(!c(GEO_ID, state, NAME) &
           !ends_with(c("EA", "MA", "M"))) %>%
    pivot_longer(ends_with("E"),
                 names_to = "variable",
                 values_to = "estimate") %>%
    left_join(acs_vars_2019_1yr %>% select(name, label),
                                    by = c("variable" = "name")) %>%
    select(-variable) %>%
    separate(label,
             into = c(NA, NA, "sex", "education"),
             sep = "!!") %>%
    filter(!is.na(education)) %>%
    mutate(race = census_race_labels[x])
    })
bay_educ_race

1.2 Graphing educational attainment by race

bay_race_total = bay_educ_race %>%
                 group_by(race) %>%
                 summarize(estimate = sum(estimate)) %>%
                 mutate(education = "Total")
bay_educ_race %>% group_by(education, race) %>%
                  summarize(estimate = sum(estimate)) %>%
                  rbind(bay_race_total) %>%
                  ggplot() +
                  geom_bar(aes(x = education %>%
                                   factor(levels = rev(c("Total",
                                                         bay_educ_race$education[1:8]))),
                               y = estimate,
                               fill = race),
                           stat = "identity",
                           position = "fill") +
                  labs(x = "Educational attainment",
                       y = "Proportion of households",
                       title = "Santa Clara County educational attainment by race",
                       subtitle = "Population 25 years or older",
                       fill = "Race of householder") +
                  coord_flip() +
                  theme(legend.position = "bottom",
                        legend.direction = "vertical")

1.3 Stacked chart

bay_race_total = bay_educ_race %>%
                 group_by(race) %>%
                 summarize(estimate = sum(estimate)) %>%
                 mutate(education = "Total")
bay_educ_race %>% group_by(education, race) %>%
                  summarize(estimate = sum(estimate)) %>%
                  rbind(bay_race_total) %>%
                  ggplot() +
                  geom_bar(aes(x = education %>%
                                   factor(levels = rev(c("Total",
                                                         bay_educ_race$education[1:8]))),
                               y = estimate,
                               fill = race),
                           stat = "identity") +
                  labs(x = "Educational attainment",
                       y = "Number of households",
                       title = "Santa Clara County educational attainment by race",
                       subtitle = "Population 25 years or older",
                       fill = "Race of householder") +
                  coord_flip() +
                  theme(legend.position = "bottom",
                        legend.direction = "vertical")

1.4 Analysis

  • There is no education data available for Native Hawaiians and other Pacific Islanders. Possibly that is because this population is very small in the Bay Area.
  • As the stacked chart shows, most Bay Area residents possess at least a bachelor’s degree. Smaller numbers of residents possess some college with no degree, and smaller still numbers possess a high school diploma alone.
  • “Some Other Race Alone” (unspecified by definition, but I am guessing these are primarily Hispanics/Latinos) is disproportionately represented among lower educational attainment levels (GED/high school diploma or less).
  • Asians are overrepresented among those with bachelor’s and graduate/professional degrees.
  • Interestingly enough, white people are overrepresented among those with high school diplomas, GEDs or alternative credentials, and those with some college and no degree.

We can observe some more interesting phenomena by flipping our analysis, that is, looking at the composition of educational attainment for every race.

1.5 Flipping the axes: Race by educational attainment

bay_educ_race %>% group_by(education, race) %>%
                  summarize(estimate = sum(estimate)) %>%
                  ggplot() +
                  geom_bar(aes(x = race,
                               y = estimate,
                               fill = education %>%
                                      factor(levels = rev(c("Total",
                                                            bay_educ_race$education[1:8])))),
                           stat = "identity",
                           position = "fill") +
                  labs(x = "Proportion of households",
                       y = "Educational attainment",
                       title = "Santa Clara County race by educational attainment",
                       subtitle = "Population 25 years or older",
                       fill = "Educational attainment") +
                  coord_flip() +
                  theme(legend.position = "bottom",
                        legend.direction = "vertical")

1.6 Analysis

  • There seems to be insufficient data among Native Hawaiian and Pacific Islander populations to plot their data.
  • Over half of Asians and whites hold a bachelor’s degree or higher.
  • On the other hand, between 25% and 50% of Blacks and less than 25% of American Indians and Alaska Natives and those of “Some Other Race Alone”.

These statistics testify to the racial stratification in Santa Clara County jobs. In the tech sector in particular, whites and Asians are overrepresented, while across all professions requiring a college degree, whites are overrepresented.

2 Number and percentage of K-12 students without Internet access

2.1 Data setup

ca_pums = get_pums(variables = c("PUMA", "ACCESS", "SCHG"),
                   state = "CA",
                   year = 2019,
                   survey = "acs1",
                   recode = T)
saveRDS(ca_pums, file = "ca_pums.rds")
pums_vars_2019 = pums_variables %>%
                 filter(year == 2019, survey == "acs1")
pums_vars_2019_inds = pums_vars_2019 %>%
                      distinct(var_code, var_label, data_type, level) %>%
                      filter(level == "person")
ca_pums = readRDS("ca_pums.rds")
ca_pumas = pumas("CA", cb = T, progress_bar = F)

# Isolate Santa Clara County
ca_counties = counties("CA", cb = T, progress_bar = F)
county_name = c("Santa Clara")
sc_county = ca_counties %>%
            filter(NAME %in% county_name)

sc_pumas = ca_pumas %>%
           st_centroid() %>%
           .[sc_county, ] %>%
           mutate(PUMACE10 = as.numeric(PUMACE10)) %>%
           st_set_geometry(NULL) %>%
           left_join(ca_pumas %>% select(GEOID10)) %>%
           st_as_sf()
sc_pums = ca_pums  %>%
          mutate(PUMA = as.numeric(PUMA)) %>%
          filter(PUMA %in% sc_pumas$PUMACE10)
# Generate number of K-12 students without internet
sc_internet = sc_pums %>%
              filter(ACCESS != "b" &
                     SCHG != "bb") %>%
              mutate(ACCESS = as.numeric(ACCESS),
                     SCHG = as.numeric(SCHG)) %>%
              filter(!duplicated(SERIALNO) &
                     (SCHG >= 2 &
                      SCHG <= 14)) %>%
              mutate(no_internet = ifelse(
                      (ACCESS == 3),
                      PWGTP,
                      0)) %>%
              group_by(PUMA) %>%
              left_join(sc_pumas %>% select(PUMACE10),
                                     by = c("PUMA" = "PUMACE10"))
# Print number
sum(sc_internet$no_internet, na.rm = T)
## [1] 2752
# Total children being tracked
sum(sc_internet$PWGTP, na.rm = T)
## [1] 163553
# Percentage of children
sum(sc_internet$no_internet, na.rm = T) / 
  (sum(sc_internet$PWGTP, na.rm = T))
## [1] 0.01682635

According to the ACS 2018 5-year results, 1.9% of K-12 students in Santa Clara County (35 out of 1835) do not have Internet access. There are a few important caveats to note about the data.

First, in the ACS 2018 and 2019 1-year results, when I ran this same code I ended up with 0 students without Internet access, which I highly doubt is the case — it’s more likely that I got that result because the 1-year samples only draw from highly populated areas, and in Santa Clara County the people least likely to have internet access will likely live in sparsely populated areas not covered by the 1-year sample. This issue suggests that less populated areas that are less likely to have internet access are also less likely to be sampled, and thus that this percentage is probably an underestimate.

A related problem is that the sample of K-12 children is very tiny to begin with. Estimates made from small samples are bound to have a large margin of error: another reason to take this estimate with a grain of salt.

Finally, the 2018 5-year data might not be very useful for understanding the state of internet access in 2020. The 2018 5-year data by definition draws on a sample from 2013-18, and internet access has broadened greatly even just between 2018 and 2020, let alone from 2013 to 2020.

2.2 Mapping internet access across the Bay Area

2.2.1 Data setup

# Broaden PUMS data to all Bay Area
bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )
bay_counties = ca_counties %>%
               filter(NAME %in% bay_county_names)
bay_pumas = ca_pumas %>%
            st_centroid() %>%
            .[bay_counties, ] %>%
            mutate(PUMACE10 = as.numeric(PUMACE10)) %>%
            st_set_geometry(NULL) %>%
            left_join(ca_pumas %>% select(GEOID10)) %>%
            st_as_sf()
bay_pums = ca_pums %>%
           mutate(PUMA = as.numeric(PUMA)) %>%
           filter(PUMA %in% bay_pumas$PUMACE10)

# Generate number of K-12 students without internet
bay_internet = bay_pums %>%
               filter(ACCESS != "b" &
                      SCHG != "bb") %>%
               mutate(ACCESS = as.numeric(ACCESS),
                      SCHG = as.numeric(SCHG)) %>%
               filter(!duplicated(SERIALNO) &
                      (SCHG >= 2 &
                      SCHG <= 14)) %>%
               mutate(no_internet = ifelse(
                       (ACCESS == 3),
                       PWGTP,
                       0)) %>%
               group_by(PUMA) %>%
               mutate(perc_nointernet = sum(no_internet, na.rm = T)/
                                        (sum(PWGTP, na.rm = T))*100) %>%
               left_join(bay_pumas %>% select(PUMACE10),
                                       by = c("PUMA" = "PUMACE10")) %>%
               st_as_sf()

2.2.2 Map

pums_pal = colorNumeric(palette = "Oranges",
                        domain = bay_internet$perc_nointernet)

leaflet() %>%
  addTiles() %>%
  addPolygons(data = bay_internet,
              fillColor = ~pums_pal(perc_nointernet),
              color = "white",
              opacity = 0.5,
              fillOpacity = 0.5,
              weight = 1,
              label = ~paste0(round(perc_nointernet),
                              "% K-12 students without Internet access"),
              highlightOptions = highlightOptions(weight = 2,
                                                  opacity = 1)) %>%
  addLegend(data = bay_internet,
            pal = pums_pal,
            values = ~perc_nointernet,
            title = "% K-12 students<br>without Internet access")

3 Migration by education

3.1 Data setup

acs_vars_2019_1yr = listCensusMetadata(name = "2019/acs/acs1",
                                       type = "variables")
sc_mobility_current_19 = getCensus(name = "acs/acs1",
                                   vintage = 2019,
                                   region = "county:085",
                                   regionin = "state:06",
                                   vars = c("group(B07009)")) %>%
                          select(!c(GEO_ID, state, NAME) &
                                 !ends_with(c("EA", "MA", "M"))) %>%
                          pivot_longer(ends_with("E"),
                                       names_to = "variable",
                                       values_to = "estimate") %>%
                          left_join(acs_vars_2019_1yr %>% select(name, label),
                                    by = c("variable" = "name")) %>%
                          select(-variable) %>%
                          separate(label,
                                   into = c(NA, NA, "mobility", "education"),
                                   sep = "!!") %>%
                          mutate(mobility = ifelse(
                                   mobility %in% c("Same house 1 year ago:",
                                                   "Moved within same county:"),
                                   "Here since last year",
                                   "Inflow")) %>%
                          filter(!is.na(education)) %>%
                          group_by(mobility, education) %>%
                          summarize(estimate = sum(estimate))

sc_mobility_lastyear_19 = getCensus(name = "acs/acs1",
                                    vintage = 2019,
                                    region = "county:085",
                                    regionin = "state:06",
                                    vars = c("group(B07409)")) %>%
                           select(!c(GEO_ID, state, NAME) &
                                 !ends_with(c("EA", "MA", "M"))) %>%
                           pivot_longer(ends_with("E"),
                                       names_to = "variable",
                                       values_to = "estimate") %>%
                           left_join(acs_vars_2019_1yr %>% select(name, label),
                                    by = c("variable" = "name")) %>%
                           select(-variable) %>%
                           separate(label,
                                    into = c(NA, NA, "mobility", "education"),
                                    sep = "!!") %>%
                           mutate(mobility = ifelse(
                                    mobility %in% c("Same house:",
                                                    "Moved within same county:"),
                                    "Here since last year",
                                    "Outflow")) %>%
                           filter(!is.na(education)) %>%
                           group_by(mobility, education) %>%
                           summarize(estimate = sum(estimate))

sc_mobility_current_18 = getCensus(name = "acs/acs1",
                                    vintage = 2018,
                                    region = "county:085",
                                    regionin = "state:06",
                                    vars = c("group(B07009)")) %>%
                          select(!c(GEO_ID, state, NAME) &
                                 !ends_with(c("EA", "MA", "M"))) %>%
                          pivot_longer(ends_with("E"),
                                       names_to = "variable",
                                       values_to = "estimate") %>%
                          left_join(acs_vars_2019_1yr %>% select(name, label),
                                    by = c("variable" = "name")) %>%
                          select(-variable) %>%
                          separate(label,
                                   into = c(NA, NA, "mobility", "education"),
                                   sep = "!!") %>%
                          mutate(mobility = "Here last year") %>%
                          filter(!is.na(education)) %>%
                          group_by(mobility, education) %>%
                          summarize(estimate = sum(estimate))

sc_flows_19 = rbind(sc_mobility_current_18,
                     sc_mobility_lastyear_19 %>%
                       filter(mobility == "Outflow"),
                     sc_mobility_current_19 %>%
                       filter(mobility == "Inflow"),
                     sc_mobility_current_19 %>%
                       group_by(education) %>%
                       summarize(estimate = sum(estimate)) %>%
                       mutate(mobility = "Here this year")) %>%
                     pivot_wider(names_from = mobility,
                                 values_from = estimate) %>%
                     mutate(`External net` = Inflow - Outflow,
                            `Internal net` = `Here this year` -
                                             `Here last year` -
                                             `External net`,) %>%
                     select(`Education level` = education,
                            `Internal net`,
                            `External net`,
                            `Here last year`,
                            `Here this year`,
                            Outflow,
                            Inflow)

3.2 Table

# Rearrange data to be in order of educational attainment
attainment_ordered = c("Less than high school graduate",
                       "High school graduate (includes equivalency)",
                       "Some college or associate's degree",
                       "Bachelor's degree",
                       "Graduate or professional degree")

sc_flows_19 %>%
  mutate(`Education level` = factor(`Education level`,
                                    levels = attainment_ordered)) %>%
  arrange(`Education level`)

“Internal net” refers to changes in this category not due to migration into and out of the county; “external net” refers to changes that are due to migration into and out of the county.

Looking at the “Internal Net” column, it appears that the number of less-educated people grew for reasons not related to immigration or emigration from the county, while numbers of more-educated people fell. This is a bizarre finding since it is not generally possible for people to lose education, but one possible explanation is a large number of survey respondents were high school and college students being counted for the first time (the ACS counts people from age 15 upwards). Another possible explanation that an unusually large number of people with bachelor’s and above degrees died from 2018 to 2019.

Looking at the “External Net” column, it seems that people of most educational levels are emigrating from the county except for the extremely well educated (those with a graduate/professional degree) and the least educated (high school graduates). This pattern squares more with conventional logic about Bay Area migration patterns and suggests that the Bay Area economy tend to draw both highly-educated workers (who work as researchers, tech workers, or other highly-skilled occupations), and less-educated workers (who work in service-oriented occupations). Overall, the data suggests that to some extent, these extremely well-educated and less-educated people are pushing out people who are moderately educated. However, this effect is limited, since the total outflow (about 4,000 people) is much less than the total inflow (about 11,000 people).